Pengantar Persiapan Data Pribadi dalam RAG

Dasar RAG

Standard Large Language Models (LLMs) are "frozen" in time, limited by their training data cut-off. They cannot answer questions about your company’s internal handbook or a private video meeting from yesterday. Generasi yang Diperkaya Pemungutan (RAG)menjembatani kesenjangan ini dengan memberikan konteks relevan kepada LLM dari data pribadi Anda sendiri.

Alur Kerja Berlangkah Ganda

Untuk membuat data pribadi menjadi "dapat dibaca" bagi LLM, kami mengikuti pipa tertentu:

Memuat:Mengonversi berbagai format (PDF, Web, YouTube) ke format dokumen standar.
Membagi:Memecah dokumen panjang menjadi bagian-bagian kecil yang lebih mudah dikelola ("chunk").
Embedding:Mengonversi potongan teks menjadi vektor numerik (representasi matematis dari makna).
Penyimpanan:Menyimpan vektor-vektor ini di Vectorstore (seperti Chroma) untuk pencarian kemiripan yang sangat cepat.

Mengapa Pembagian Menjadi Penting

LLM memiliki "jendela konteks" (batasan jumlah teks yang bisa diproses sekaligus). Jika Anda mengirim PDF 100 halaman, model akan gagal. Kami membagi data menjadi chunk agar hanya informasi paling relevan yang dikirim ke model.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why is chunk_overlap considered a critical parameter when splitting documents for RAG?

To reduce the total number of tokens used by the LLM.

To ensure that semantic context (the meaning of a thought) is not cut off at the end of a chunk.

To make the vector database store data faster.

Challenge: Preserving Context

Apply your knowledge to a real-world scenario.

You are loading a YouTube transcript for a technical lecture. You notice that the search results are confusing "Lecture 1" content with "Lecture 2."

Task

Which splitter would be best for keeping context like "Section Headers" intact?

Solution:
MarkdownHeaderTextSplitter or RecursiveCharacterTextSplitter. These allow you to maintain document structure in the metadata, helping the retrieval system distinguish between different chapters or lectures.